[record] add HWNODE record - 0.5527 by lucamignatti · Pull Request #818 · openai/parameter-golf

lucamignatti · 2026-03-26T06:28:59Z

In this submission I improve upon a baseline of PR #800 (bpb 0.5654) by introducing Hammerstein-Wiener Neural ODEs (HWNODEs).

What is HWNODE?

HWNODE is a novel weight-shared continuous-depth replacement for MLPs that can be more parameter efficient than MLPs at small parameter counts, and may also permit dynamic compute at larger scales. This works by borrowing from Hammerstein-Wiener models from control theory by placing a linear ODE between two non-linearities to behave like a non-linear ODE. This is useful, since the linear core admits a closed-form matrix-exponential form, which we approximate with a low-order Taylor truncation instead of using an iterative ODE solver. By repeatedly applying the same HW block across different virtual depth states, we obtain multiple unique non-linear layers from one shared parameter set. To keep this stable (||A|| <= 1), we use spectral normalization.

How did I get here?

When looking at the state of parameter golf records within the first few days, the largest improvements seemed to come from managing to add another layer without significantly reducing width. This holds with the common wisdom that depth > width, especially at this scale. I also did not believe that the MLPs are maximally parameter efficient since you can quantize the weights to half the size without breaking them or losing nearly that much information/reasoning abilities. Given this, I began looking at a way to loop through the same weights while avoiding credit assignment problems or performance issues. This is a hard problem to solve, but there were two main ways to think about this. First off, you can treat this as an equilibrium problem, leading to DEQs. These apply one layer repeatedly until convergence. Because of the repetition, these are too slow to train and run for this challenge. Alternatively to this, you can learn a differential equation. This allows modeling of complex dynamics with respect to another variable. This is called a neural ODE (Ordinary Differential Equation), and we model this dynamic with respect to depth. To make this practical, we approximate exp(AΔt) with a Taylor polynomial and place nonlinear maps around the linear ODE core at each shared depth step. While this lets us effectively model depth and generate virtual layers, this too has a few issues. First off, and most glaring, speed. A generic nonlinear ODE would require iterative numerical integration, which is too expensive for this setting. To resolve this, we can take a page out of control theory (Hammerstein-Wiener models) and make the ODE linear (thus solvable efficiently) while wrapping it in non-linearities. Secondly, instability. If the neural ODE diverges, then the gradients can explode or behave unpredictably. This is easy to resolve using spectral normalization (keeps ||A|| <= 1, bounding the operator norm of the linear dynamics and greatly improving stability). This architecture generates virtual depth by reusing the same Hammerstein-Wiener block across repeated shared depth steps, while the Taylor expansion approximates the linear flow inside each step. This results in a performant and theoretically unbounded parameter-sharing architecture.

Data

When testing HWNODE against an MLP baseline, I performed both LM testing via parameter golf and RL experiments. In RL, HWNODE looks compelling as a parameter-efficient policy network. This is because we can create virtual depth with the same parameter count, which at these scales is far more important than width (though we do not sacrifice width either). As a language model, however, HWNODE is unable to beat the strongest MLPs at the same scale. However, HWNODE degrades much less aggressively with quantization, so it wins when the sizes are scaled.

Reinforcement Learning: LunarLander-v3 (PPO, 500K steps, 3 seeds)

This experiment measures the final mean reward over the last 100 episodes of training. A score above 200 is solved. The narrow MLP remains the top performer, but fixed-Taylor HWNODE is more parameter-efficient. In this test, a 6.3k-parameter HWNODE solves the task across all seeds, and the scaled HWNODE remains competitive with much larger MLP baselines.

Config	Params	Mean Reward
`mlp-narrow`	9,573	240.6 +/- 9.1
`mlp-medium`	35,525	230.1 +/- 21.7
`mlp-large`	136,581	228.8 +/- 9.8
`hwnode-standard`	6,343	215.4 +/- 13.3
`hwnode-scaled`	21,895	228.5 +/- 17.1
`taylor-learned`	11,801	178.9 +/- 61.3
`chebyshev-learned`	11,801	217.6 +/- 10.8
`cheb-ortho-init`	11,801	217.6 +/- 10.8
`cheb-ortho-param`	11,801	200.5 +/- 23.3
`chebyshev-scaled`	61,295	230.4 +/- 28.8

The most important result here is that fixed-Taylor HWNODE remains competitive under heavy compression. hwnode-standard solves LunarLander at only 6.3k parameters, while hwnode-scaled approaches the performance of much larger MLPs. At the same time, the best absolute mean reward still belongs to mlp-narrow, so the RL evidence supports HWNODE as parameter-efficient and competitive rather than universally superior.

Parameter Golf HWNODE vs Parameter-Matched MLP (RX 6800 XT, 10-minute proxy, lower is better)

These tests compare HWNODE against an MLP of roughly similar parameter count. Because sliding-window evaluation was too slow to run consistently on this hardware, the most useful final comparison metric here is final_int6_roundtrip_exact.

Config	Params	Train Batch Tokens	Steps in 10m	Step Avg	Online Val BPB	`final_int6_roundtrip_exact`
`MLP_MULT=1.0`	18,375,772	65,536	74	8128.33ms	2.6954	3.63482685
`HWNODE_STATE_DIM=384, ORDER=2, VDEPTH=2`	18,556,007	65,536	65	9289.67ms	3.1356	3.53553862
`HWNODE_STATE_DIM=384, ORDER=2, VDEPTH=6`	18,556,007	32,768	79	7617.18ms	3.0798	3.58122422

The important pattern is that the MLP baseline wins in full precision, but the simpler HWNODE variants continue to perform well under quantization. The ORDER=2, VDEPTH=2 HWNODE is currently the strongest compressed model, beating the roughly parameter-matched MLP on final_int6_roundtrip_exact despite having worse online validation before export. Increasing virtual depth from 2 to 6 improves the online validation number, but loses some of that gain after quantization, suggesting a tradeoff between representational power and quantization robustness.

These experiments suggest two complementary conclusions. In RL, HWNODE is strongest as a highly parameter-efficient alternative to an MLP, remaining competitive even under severe compression. In parameter-golf, the corrected shared-depth HWNODE is not better than MLPs in full precision, but it appears substantially more robust under int6 export. The results from virtual depth 6 after quantization, however, suggest it may be possible to learn a more useful representation than the space gains from quantization. Further experimentation is required.

Conclusion

In this PR I introduce a novel method for generating virtual layers during runtime from a single set of parameters. This architecture has been shown to be at least equivalently expressive to MLPs across a limited variety of RL Tests, and more parameter efficient in extremely compressed scenarios. This has shown promise in language modeling, and has much potential for future exploration. Some areas for exploration include alternative expansion methods to taylor series (escaping the 1/n! limitation), learnable virtual layer count (dynamic thinking), using different activation functions (and/or using them in different places), or using a learnable taylor (learnable k), among other things.

add HWNODE record

54e7686

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[record] add HWNODE record - 0.5527#818

[record] add HWNODE record - 0.5527#818
lucamignatti wants to merge 1 commit intoopenai:mainfrom
lucamignatti:main

lucamignatti commented Mar 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lucamignatti commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lucamignatti commented Mar 26, 2026 •

edited

Loading